{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "ae69ce68",
   "metadata": {},
   "source": [
    "## 1. PLSC-ViT模型简介\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "35485bc6",
   "metadata": {},
   "source": [
    "PLSC-ViT实现了基于Transformer的视觉分类模型。ViT对图像进行切分成patch，之后基于patch拉平的sequence进行线性embedding，并且添加了position embeddings和classfication token，然后将patch序列输入到标准的transformer编码器，最终经过一个MLP进行分类。模型结构如下，\n",
    "\n",
    "![Figure 1 from paper](https://github.com/google-research/vision_transformer/raw/main/vit_figure.png)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "97e174e6",
   "metadata": {},
   "source": [
    "## 2. 模型效果 "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "78137a72",
   "metadata": {},
   "source": [
    "| Model | Phase | Dataset | gpu | img/sec | Top1 Acc | Official |\n",
    "| --- | --- | --- | --- | --- | --- | --- |\n",
    "| ViT-B_16_224 |pretrain  |ImageNet2012  |A100*N1C8  |  3583| 0.75196 | 0.7479 |\n",
    "| ViT-B_16_384 |finetune  | ImageNet2012 | A100*N1C8 | 719 | 0.77972 | 0.7791 |\n",
    "| ViT-L_16_224 | pretrain | ImageNet21K | A100*N4C32 | 5256 | - | - |  |\n",
    "|ViT-L_16_384  |finetune  | ImageNet2012 | A100*N4C32 | 934 | 0.85030 | 0.8505 |"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ace3c48d",
   "metadata": {},
   "source": [
    "## 3. 模型如何使用"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a97a5f56",
   "metadata": {},
   "source": [
    "### 3.1 安装PLSC"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "492fa769-2fe0-4220-b6d9-bbc32f8cca10",
   "metadata": {},
   "source": [
    "```\n",
    "git clone https://github.com/PaddlePaddle/PLSC.git\n",
    "cd /path/to/PLSC/\n",
    "# [optional] pip install -r requirements.txt\n",
    "python setup.py develop\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6b22824d",
   "metadata": {},
   "source": [
    "### 3.2 模型训练"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d68ca5fb",
   "metadata": {},
   "source": [
    "1. 进入任务目录\n",
    "\n",
    "```\n",
    "cd task/classification/vit\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9048df01",
   "metadata": {},
   "source": [
    "2. 准备数据\n",
    "\n",
    "将数据整理成以下格式：\n",
    "```text\n",
    "dataset/\n",
    "└── ILSVRC2012\n",
    "    ├── train\n",
    "    ├── val\n",
    "    ├── train_list.txt\n",
    "    └── val_list.txt\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bea743ea",
   "metadata": {},
   "source": [
    "3. 执行训练命令\n",
    "\n",
    "```shell\n",
    "export PADDLE_NNODES=1\n",
    "export PADDLE_MASTER=\"127.0.0.1:12538\"\n",
    "export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7\n",
    "\n",
    "python -m paddle.distributed.launch \\\n",
    "    --nnodes=$PADDLE_NNODES \\\n",
    "    --master=$PADDLE_MASTER \\\n",
    "    --devices=$CUDA_VISIBLE_DEVICES \\\n",
    "    plsc-train \\\n",
    "    -c ./configs/ViT_base_patch16_224_in1k_1n8c_dp_fp16o2.yaml\n",
    "```\n",
    "\n",
    "更多模型的训练教程可参考文档：[ViT训练文档](https://github.com/PaddlePaddle/PLSC/blob/master/task/classification/vit/README.md)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "186a0c17",
   "metadata": {},
   "source": [
    "### 3.3 模型推理"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e97c527c",
   "metadata": {},
   "source": [
    "1. 下载预训练模型\n",
    "\n",
    "```shell\n",
    "mkdir -p pretrained/vit/ViT_base_patch16_224/\n",
    "wget -O ./pretrained/vit/ViT_base_patch16_224/imagenet2012-ViT-B_16-224.pdparams https://plsc.bj.bcebos.com/models/vit/v2.4/imagenet2012-ViT-B_16-224.pdparams\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a07c6549",
   "metadata": {},
   "source": [
    "2. 导出推理模型\n",
    "\n",
    "```shell\n",
    "plsc-export -c ./configs/ViT_base_patch16_224_in1k_1n8c_dp_fp16o2.yaml -o Global.pretrained_model=./pretrained/vit/ViT_base_patch16_224/imagenet2012-ViT-B_16-224 -o Model.data_format=NCHW -o FP16.level=O0\n",
    "```\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d375934d",
   "metadata": {},
   "source": [
    "## 4. 相关论文及引用信息\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "29f05b07-d323-45e4-b00d-0728eafb5af7",
   "metadata": {},
   "source": [
    "```text\n",
    "@article{dosovitskiy2020,\n",
    "  title={An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale},\n",
    "  author={Dosovitskiy, Alexey and Beyer, Lucas and Kolesnikov, Alexander and Weissenborn, Dirk and Zhai, Xiaohua and Unterthiner, Thomas and  Dehghani, Mostafa and Minderer, Matthias and Heigold, Georg and Gelly, Sylvain and Uszkoreit, Jakob and Houlsby, Neil},\n",
    "  journal={arXiv preprint arXiv:2010.11929},\n",
    "  year={2020}\n",
    "}\n",
    "```"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
